AITopics

Country: Asia > China (0.14)

Genre: Research Report (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Neural Information Processing SystemsApr-24-2026, 10:09:18 GMT

018b59ce1fd616d874afad0f44ba338d-Paper.pdf

artificial intelligence, machine learning, relation, (16 more...)

Country: Asia > India (0.28)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-17-2026, 15:20:08 GMT

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection - Supplemental Material - 1 More Ablation Studies 1.1 Number of test categories in evaluation

The distillation can also cover 3D object boxes on the background.

artificial intelligence, category, machine learning, (16 more...)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Security & Privacy (0.94)
Information Technology > Artificial Intelligence > Vision (0.67)

Neural Information Processing SystemsFeb-17-2026, 15:20:05 GMT

e352b765e625934ce86919995e2371aa-Paper-Conference.pdf

category, machine learning, natural language, (18 more...)

Country: Asia > China > Hong Kong (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.73)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Neural Information Processing SystemsFeb-7-2026, 08:05:27 GMT

Landmark-RxR: SolvingVision-and-Language NavigationwithFine-GrainedAlignmentSupervision

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately.

machine learning, natural language, trajectory, (18 more...)

Country: Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Akbarian, Fatemeh, Baninajjar, Anahita, Zhang, Yingyi, Balashankar, Ananth, Aminifar, Amir

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

arXiv.org Artificial IntelligenceDec-1-2025

Abstract--Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions [35], where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. T o counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., V ariational Autoencoders (V AEs), to maintain natural alignment. T o further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 46) and 11% (32 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions. Multi-modal foundation models have rapidly advanced the frontier of visual and linguistic understanding. Foundation models such as CLIP [19], ALIGN [11], and ImageBind [8] align a variety of heterogeneous modalities including images, text, and other modalities within a shared embedding space, thereby enabling zero-shot classification, cross-modal retrieval, and generative conditioning. The shared embedding space that underpins cross-modal flexibility simultaneously introduces a new attack surface, giving rise to adversarial illusions [35]. As downstream tasks directly rely on the integrity of this shared representation, even small perturbations in one modality can induce semantic misalignment across others, misleading models that depend on the embedding for retrieval, captioning, or generative conditioning. Defending against such cross-modal attacks presents unique challenges.

accuracy, machine learning, natural language, (17 more...)

2511.21893

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceNov-27-2025

Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

You, Xiaoxing, Huang, Qiang, Li, Lingyu, Zhang, Chi, Liu, Xiaopeng, Zhang, Min, Yu, Jun

News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

large language model, machine learning, merge, (22 more...)

2511.21002

Country: Asia > China (0.68)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.93)
Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kumar, Amit, Kaur, Maninder, Mall, Raghvendra, Gupta, Sukrit

CASPER: Cross-modal Alignment of Spatial and single-cell Profiles for Expression Recovery

arXiv.org Artificial IntelligenceNov-20-2025

Spatial Transcriptomics enables mapping of gene expression within its native tissue context, but current platforms measure only a limited set of genes due to experimental constraints and excessive costs. To overcome this, computational models integrate Single-Cell RNA Sequencing data with Spatial Transcriptomics to predict unmeasured genes. We propose CASPER, a cross-attention based framework that predicts unmeasured gene expression in Spatial Transcriptomics by leveraging centroid-level representations from Single-Cell RNA Sequencing. We performed rigorous testing over four state-of-the-art Spatial Transcriptomics/Single-Cell RNA Sequencing dataset pairs across four existing baseline models. CASPER shows significant improvement in nine out of the twelve metrics for our experiments. This work paves the way for further work in Spatial Transcriptomics to Single-Cell RNA Sequencing modality translation. The code for CASPER is available at https://github.com/AI4Med-Lab/CASPER.

artificial intelligence, gene expression, machine learning, (18 more...)

2511.15139

Country: Asia > Middle East > Qatar (0.29)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

arXiv.org Artificial IntelligenceNov-12-2025

Text-based Aerial-Ground Person Retrieval

Zhou, Xinyu, Wu, Yu, Ma, Jiayao, Wang, Wenhao, Cao, Min, Ye, Mang

This work introduces Text-based Aerial-Ground Person Retrieval (T AG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T -PR), which focuses solely on ground-view images, T AG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) T AG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) T AG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint de-coupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of T AG-CLIP on both the proposed T AG-PEDES dataset and existing T -PR benchmarks. The dataset and code are available at https://github.com/Flame-Chasers/T

artificial intelligence, machine learning, natural language, (20 more...)

2511.08369

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.47)
(2 more...)

arXiv.org Artificial IntelligenceNov-11-2025

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Wang, Zesen, Lan, Lijuan, Li, Yonggang

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

forecasting, large language model, machine learning, (17 more...)

2506.17631

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)